Introduction

This note serves as an overview of some of the algorithms used in Bayesian statistics, starting with an introduction to markov chains and Monte Carlo methods; the two fundamental components that make up a Markov Chain Monte Carlo (MCMC) process.

Random Walk

A random walk pretty much does what it says on the tin. It’s a collection of random steps that, when plotted, represent a random process. An often used example is a particle moving in one dimension. With equal probability the partical moves up or down at time \(t\) relative to it position at time \(t-1\). This can be represented with a Bernoulli probability mass function. Another example would be to let \(\varepsilon ~ \mathcal{N}(0,1)\) and define, \[ X_n = X_{n-1} + \varepsilon \] which gives us a continuous random walk.

The code below computes a discrete random walk in two dimensions with equal probabilities being asigned to each movement: up, down, left, or right.

# RANDOM WALK

loc <- c("up", "down", "left", "right")
dat <- matrix(NA, ncol = 2, nrow = 1e5/2)
dat[1,] <- c(0,0)

for(i in 2:nrow(dat)) {
  rand <- sample(loc, 1, prob = rep(0.25,4))
  if(rand == "up") {
    dat[i,] <- c(0,1) + dat[i-1,]
  }
  else if(rand == "down") {
    dat[i,] <- c(0,-1) + dat[i-1,]
  }
  else if(rand == "left") {
    dat[i,] <- c(1,0) + dat[i-1,]
  }
  else if(rand == "right") {
    dat[i,] <- c(-1,0) + dat[i-1,]
  }
  else {
    dat[i,] <- c(NA,NA)
  }
}

Below, the figure on the left plots the first 5,000 steps of the random walk, and the figure on the right plots all 50,000 steps of the walk with the first 5,000 steps highlighted. The darker regions indicate the parts of the process that walked over itself.

Markov Chains

Consider a sequence of random variables \(X_0, X_1, X_2, \ldots\) where \(X_n\) denotes the state at time \(n \geq 0\). If the sequence is a Markov chain then the the probability of \(X_n\) depends only on \(X_{n-1}\). Thus, we can write the conditional probability as, \[ P(X_{n} = j | X_{n-1} = i) \] where \(j\) and \(i\) are values from the state space. The conditional probability above is the transition probability of moving from state \(i\) in time \(n-1\) to state \(j\) in time \(n\).

The mathematical statement above is best summarized with the Markov property, which states that only the most recent term \(X_{n-1}\) of all the past values \(X_{n-1}, X_{n-2}, X_{n-3}, \ldots, X_0\) influences the prediction of the future state \(X_n\). This also means that \(X_n\) and \(X_{n-2}, X_{n-3}, \ldots, X_0\) are conditionally independent. One useful feature of Markov chains is that it saves us from having to use the entire history to predict the future value.

To simulate a Markov chain we need to compute the probability of moving from one state to the next. These probabilities can be encoded in a transition matrix, the \((i,j)\) element of which defines the probability of moving to state \(j\) from state \(i\) for all \(i,j = 1, 2, \ldots, K\)$.A transition matrix is square since, given the present state, it must describe the probability of moving to each possible state. Additionally, the probabilities of moving to each state are disjoint (e.g. if our states are directions then it’s not possible to both move up and down). As a result the rows of the transition matrix sum to one (i.e. the states represent all the events in the sample space.)

Note that

Now we want to describe how to actually make the move from state \(n-1\) to state \(n\). Let the \(1 \times K\) vector \(\vec{v}_n\) denote the probabilites of the state at \(n\) and let \(M\) denote the \(K \times K\) transition matrix. Then the marginal distribution of each state is given applying the law of total probability in the following way, \[ \vec{v}_{n} = \vec{v}_{n-1} M \] Each element \(j\) of \(\vec{v}_n\), is equal to \(\sum_{i=1}^{K} (\vec{v}_{n-1})_i \cdot M_{i,j}\). In other words the \(j^{th}\) element of \(\vec{v}_n\) is the dot product of \(\vec{v}_{n-1}\) and the \(j^{th}\) column of \(M\).

Intutively this makes sense. We start with a vector of probabilities describing our previous state and a transition matrix describing the move to the next state. Each column of the transition matrix defines a \(K\) vector of probabilities associated with moving to a specific state \(j\) given each of the previous states the chain could be at \(i = 1, 2, \ldots, K\). Then, applying the law of total probability, the sum of the product of the probability of being at each previous state \(i\) and the associated probability of moving to the next state (given by the \((i,j)\) element of the transition matrix) will give the marginal probability of the next state \(j\).

Similar to the random walk above, we simulate two Markov chains moving through two dimensional space with the movements: up, down, left, and right.

The first Markov chain emulates a random walk and uses the following transition matrix, \[ M = \begin{bmatrix} 0.25 & 0.25 & 0.25 & 0.25 \\ 0.25 & 0.25 & 0.25 & 0.25 \\ 0.25 & 0.25 & 0.25 & 0.25 \\ 0.25 & 0.25 & 0.25 & 0.25 \\ \end{bmatrix} \]

where each column of \(M\) denotes the state probabilities for moving to state \(j\) at time \(n\) given the probabilities at state \(i \in (\mbox{up, down, left, right})\) at time \(n-1\), and each row \(M\) denotes the probability of moving to each new state \(j \in (\mbox{up, down, left, right})\) given state \(i\). This is a Markov chain as a random walk since equal weight has been put on moving from state \(i\) to state \(j\) for all \(j \in (\mbox{up, down, left, right})\). For example, given that the chain’s previous state was up the probability of moving up is \(M_{1,1} = 0.25\) which is equal to the probability of moving down \(M_{1,2} = 0.25\) (and is also equal to the probability of moving left/right).

The second Markov chain uses the following transition matrix, \[ M = \begin{bmatrix} 0 & 0.9 & 0 & 0.1\\ 0.9 & 0 & 0.1 & 0\\ 0.1 & 0 & 0 & 0.9\\ 0 & 0.1 & 09 & 0\\ \end{bmatrix} \] This type of transition matrix encourages the Markov chain to move in the opposite direction compared to its previous state. For example, if our previous state was moving up then (going across the top row of \(M\)) the next state is up with probability 0, down with probability 0.9, left with probability 0, and right with probability 0.1. From this point we will most likely move down. If so then (going across the second row of \(M\)) we move back up with probability 0.9 and left with probability 0.1. But there is also a 0.1 chance that we could move right after moving up. Then (going across the last row) we would move up with probability 0, down with probability 1, left with probability 0.9, and right with probability 0.

The function that we run the Markov chain through is provided below.

# MARKOV CHAIN FUNCTION

#' Construct a Markov Chain governing movement through 2D space
#' @param transition_matrix A symmetric matrix of probabilities. Rows must sum to one.
#' @param iter Number of steps to take over the 2D space.
#' @param init Initial position in the 2D space.
#' @param state Initial state to start the Markov Chain.
#' @return The probabilities (prob) that are used to calculate the movement (move) and the data (dat) that
#' describes the movemement over 2D space.
markov_chain <- function(transition_matrix, iter = 1e3, init = c(0,0), state = c(1,0,0,0)) {
  # setup containers
  dat <- array(NA, c(iter, 2))  
  prob <- array(NA, c(iter,4))
  move <- array(NA, iter)
  # initial parameterizations
  loc <- colnames(transition_matrix)
  dat[1,] <- init
  # loop to contstruct the chain
  for(i in 2:iter) {
    state <- state%*%transition_matrix
    rand <- sample(loc, 1, prob = state)

    prob[i,] <- c(state)
    move[i] <- c(rand)

    if(rand == "up") {
      dat[i,] <- c(0,1) + dat[i-1,]
    }
    else if(rand == "down") {
      dat[i,] <- c(0,-1) + dat[i-1,]
    }
    else if(rand == "left") {
      dat[i,] <- c(-1,0) + dat[i-1,]
    }
    else if(rand == "right") {
      dat[i,] <- c(1,0) + dat[i-1,]
    }
    else {
      dat[i,] <- c(NA,NA)
    }
  }
  # return containers as a list
  return(list("dat" = dat, "prob" = prob, "move" = move))
}

Now we run the simulations using each transition matrix with the initial condition \(v_0 = (0, 0, 1, 0)\).

# MARKOV CHAIN (as a random walk)
trans_prob <- matrix(0.25, ncol = 4, nrow = 4)
colnames(trans_prob) <- row.names(trans_prob) <- c("up", "down", "left", "right")
iter <- 1e5
mc_rw <- markov_chain(trans_prob, iter = iter)

# MARKOV CHAIN (with a negating transition matrix)
trans_prob <- rbind(c(0, 0.9, 0, 0.1),
                    c(0.9, 0, 0.1, 0),
                    c(0.1, 0, 0, 0.9),
                    c(0, 0.1, 0.9, 0))
colnames(trans_prob) <- row.names(trans_prob) <- c("up", "down", "left", "right")
mc <- markov_chain(trans_prob, iter = iter)

Below are the probabilities and associated move (via sampling from the binomial distribution) of the first 10 steps of the negating Markov chain.

##     x  y        up      down       left      right move
## 1   0  0        NA        NA         NA         NA <NA>
## 2   0 -1 0.0000000 0.9000000 0.00000000 0.10000000 down
## 3   0  0 0.8100000 0.0100000 0.18000000 0.00000000   up
## 4   0 -1 0.0270000 0.7290000 0.00100000 0.24300000 down
## 5   0  0 0.6562000 0.0486000 0.29160000 0.00360000   up
## 6   0 -1 0.0729000 0.5909400 0.00810000 0.32806000 down
## 7   0  0 0.5326560 0.0984160 0.35434800 0.01458000   up
## 8  -1  0 0.1240092 0.4808484 0.02296360 0.37217880 left
## 9  -2  0 0.4350599 0.1488262 0.38304576 0.03306816 left
## 10 -3  0 0.1722481 0.3948607 0.04464396 0.38824718 left

Monte Carlo Methods

A Monte Carlo method is typically used as a way to use random sampling to numerically solve a probablem that is not analytically tractable (e.g. the area under a a curve). The algorithm proceeds as follows,

ALGORITHM (Monte Carlo Method)

  1. Define the domain.
  2. Sample from the domain according to some probability distribution (e.g. uniform)
  3. Perform a deterministic computation using (2)
  4. If (4) is true then accept the inputs, else reject
  5. Aggregate the results.

Estimating \(\pi\)

Using Monte carlo methods we can estimate the mathematical constant \(\pi\) which is equal to ratio of the circumference to the diameter. Consider a unit circle inscribed in a square. The are of the circle is \(\pi\) and the area of the square is 4, so the ratio of the area of the circle and the square is \(\pi/4\). Now all we need to do is construct a Monte Carlo method to estimate this ratio and multiply the result by 4 to get an estimate of \(\pi\).

  1. Define the domain as \(x\in[0,1]\) and the range as \(y\in[0,1]\).
  2. Sample x and y with \(x,y\sim Unif(0,1)\).
  3. Compute \(\alpha = \sqrt{r^2 - x^2}\) where \(r = 1\) is the radius of the circle.
  4. If \(\alpha \leq y\) then accept, else reject.
  5. Compute the ratio \(N_{accept}/(N_{accept} + N_{reject})\), where \(N_{accept}\) and \(N_{reject}\) are the total number of accepted and rejected values, resepectively.
  6. Multiply (6) by 4 to get an estimate of \(\pi\).

Below is the code that runs through the steps above to estimate \(\pi\).

circle <- function(x) {
  sqrt(1 - x^2)
}

domain <- seq(0,1, length.out = 1e3)
iter <- 1e4 * 2
accept <- matrix(NA, ncol = 2, nrow = iter)
reject <- matrix(NA, ncol = 2, nrow = iter)
for(i in 1:iter) {
  x <- runif(1, 0, 1)
  y <- runif(1, 0, 1)
  if(x^2 + y^2 <= 1)
    accept[i,] <- c(x,y)
  else
    reject[i,] <- c(x,y)
}
cat('estimate of pi = ', (nrow(na.omit(accept)) / iter) * 4)
## estimate of pi =  3.1214

The figure below illustrates the accepted and rejected values.

Accept-Reject Algorithm

The accept-reject algorithm is a Monte Carlo method that enables you to sample from a probability distribution. Let \(f\) be the target distribution (i.e. the distribution you want to get samples from) and let \(g\) be the candidate distribution (i.e. the distribution you use to randomly sample out of the domain of interest). We need these distributions to statisfy two conditions. First, both \(g\) and \(f\) need to be defined over the same support. Second, from some value in the support \(f(x)/g(x) \leq M\) for some constant \(M\). The algorithm proceeds as follows,

ALGORITHM (Rejection Sampling)

  1. Generate \(y\) from the proposal distribution \(g\).
  2. Calculate the likelihood ratio \(\alpha = f(y)/(Mg(y))\) where \(M\) is a bound on the likelihood ratio.
  3. Generate a stochastic condition \(u ~ Unif(0,1)\).
  4. If \(\alpha > u\) accept \(y\), else reject.
# MONTE CARLO - REJECTION SAMPLING

#' A Rejection Sampling algorithm where proposals are generated from Unif(0,1).
#' @param f Target distribution.
#' @param g Candidate distribution.
#' @param support Support of the target distribution.
#' @param M Bound on the likelihood ratio f/g.
#' @param iter Number of iterations to run the algorithm through (iter = accept + reject).
#' @return A list containing accepted/rejected values.
accept_reject <- function(f, g, support = c(0,1), M = 1, iter = 1e3) {
  out <- list()

  for (i in 1:iter) {
    # containers
    y <- runif(1, support[1], support[2])  # proposal
    u <- runif(1, 0, 1)                    # stochastic condition
    out$y[i] <- y
    out$u[i] <- u
    # likelihood ratio
    ratio <- f(y)/(M*g(y, support[1], support[2]))
    # accept-reject condition
    if(ratio > u)
      out$accept[i] <- y
    else
      out$reject[i] <- y
  }
  return(out)
}

The code below applies rejection sampling to sample from the beta distribution using different values of \(M\).

beta_sample1 <- accept_reject(f = function(x){dbeta(x, 2, 2)},
                             g = function(x, a, b){dunif(x, a, b)},
                             M = 1, iter = 1e4*4)
beta_sample2 <- accept_reject(f = function(x){dbeta(x, 2, 2)},
                              g = function(x, a, b){dunif(x, a, b)},
                              M = 2, iter = 1e4*4)

The table below describes the total number of accepted (TRUE) and rejected (FALSE) values in each sampling procedure.

##         M=1   M=2
## FALSE  7674 20049
## TRUE  32320 19951

Notice that with a low value of \(M\) we ended up accepting too many values. In fact, what ended up happening is that we over accepted where \(f\) has more mass. This is apparent in the figure below which plots the accepted and rejected values. When \(M=1\), the empirical beta distribution getting squashed around its expected value. With a more adequate value for \(M\) (such as \(M=2\)) we get a better approximation of the true beta distribution. (For comparison purposes, the figure on the far left plots samples from the \(Beta(2,2)\) distribution using the R function rbeta().)

If we state our acceptance rule as \(u \leq f(y)/(Mg(y)) \implies u(Mg(y)) \leq f(y)\) then why this is happening should be clear. If \(M_1 < M_2\) then for any given \(u\) and \(y\), \(u(M_1g(y)) < u(M_2g(y))\) which means that the condition \(u(M_1g(y)) \leq f(y)\) will end up accepting \(y\) more than \(u(M_2g(y)) \leq f(y)\).

Metropolis Algorithm

Now we want to combine these two aforementioned ideas of Markov chains and Monte Carlo methods. If we use a Markov chain to move through some mathematical space and sample according to some Monte Carlo method then what we have is called Markov chain Monte Carlo (MCMC). The Metropolis algorithm is a simple case of this. Again, we have a target distribution \(f\) that we are interested in sampling from and a symmetric candidate distribution \(g\) that we generate proposals from. Let the index \(n\) denote the current position of the chain. The algorithm proceeds as follows,

ALGORITHM (Metropolis)

  1. Generate \(x^* \sim g(x_n)\)
  2. Compute the acceptance ratio \(\alpha = f(x^*)/f(x_n)\).
  3. Accept alpha with probability \(min(1,\alpha)\) (i.e. if \(\alpha < 1\) then we will need to compare it with a stochastic criterion \(u \sim Unif(0,1)\) in order to determine whether we accept/reject the proposal.)
  4. If \(x^*\) is accepted then \(x_{n+1} = x^*\), else \(x_{n+1} = x_{n}\).

The code below creates a function that runs the Metropolis algorithm. The user can declare the number of iterations and the number of chains. Both are important inputs. The number of iterations ensures that we have a large enough sample from the distribution, and running chains with different initial values allows us to determine whether the samples converge. If they do not converge then we cannot perform any inferences on the sample since it does not adequately represent the distribution. Sometimes this can be remedied by increasing the number of iterations. In this

# METROPOLIS ALGORITHM

# 1. generate x* from g(x_{n}) (where g is a symmetric candidate distribution, e.g. normal)
# 2. compute the acceptance ratio alpha = f(x*)/f(x_{n}) (where f is the target distribution)
# 3. accept x* with probability min(1, alpha)
# 4. if x* is accepted then collect x_{n+1} = x* else collect x_{n+1} = x_{n}
# 5. repeat

metropolis <- function(f, iter = 1e3, chains = 4, prop_scale = 1) {
  init <- rnorm(chains, 0, 20)
  out <- array(NA, c(iter, chains))
  out[1,] <- init

  # define a function to sample (i.e. accpet/reject)
  sampler <- function(x) {
    alpha <- 0
    u <- 1
    while(alpha < u) {
      u <- runif(1, 0, 1)                   # stochastic criterion
      x_star <- rnorm(1, x, 1 * prop_scale) # proposal distribution
      alpha <- exp(f(x_star)-f(x))          # acceptance ratio
    }
    return(x_star)
  }

  # run the sampler
  for(i in 2:iter) {
    for(j in 1:chains) {
      out[i,j] <- sampler(out[i-1,j])       # accepting x* or x_n
    }
  }

  out
}

Before running the algorithm we need to define the posterior distribution that we want to sample from which is done in the code below.

dat <- rnorm(1, rnorm(10, 5, 5), 1)
post_func <- function(x) {
  lik <- sum(dnorm(dat, x, 1, log = T))
  prior <- dnorm(x, 5, 1, log = T)
  lik + prior
}

Mathematically we can state our model as, \[ y \sim N(\mu, 1) \\ \mu \sim N(5,1) \]

We run the algorithm and look at the median of the data and the median of the samples of \(\mu\).

metro <- metropolis(f = post_func, chains = 4, iter = 2e3)
rbind("median of data" =  median(dat),
      "median of met samples" =  median(metro))
##                            [,1]
## median of data        11.100220
## median of met samples  8.046592

It looks like we got closer to the true \(\mu\) compared to the maximum likelihood estimate of \(\mu\) (which would have been approximately equal to the mean of the data).

In the figure below the plot on the top left illustrates the full length of all four chains. Although the chains start in disparate locations, they converge pretty quickly. This is not always the case, especially in complicated models. The initial steps are useful only insofar as they eventually take us to most of the mass of the distribution. We should discard most of these initial steps since they are not really exploring much of the mass distribution (the part that we find interesting). It is conventionally to regard the first half of each chain as a warmup (or burnin) for the algorithm and conduct inferences on the last half. Next, the plot on the top right presents the first 100 iteration of each chain and we can see that the chains converge within the first 100 iterations. The plot on the bottom left illustrates the last half of each chain. Finally, the plot on bottom right represents a histogram of the sampled values of \(\mu\).

Metropolis-Hastings Algorithm

The Metropolis-Hastings algorithm is similar to the Metropolis algorithm, however, we now relax the assumption that the proposal distribution \(g\) has to be symmetric. The algorithm proceeds in the same was except that now our accept/reject condition is \(u < \frac{f(x^*)}{f(x_n)}\frac{g(x_n|x^*)}{g(x^*|x_n)}\). The component \(frac{g(x_n|x^*)}{g(x^*|x_n)}\) is included in the condition to correct for the asymmetry in the proposal distribution. The benefit of this is efficiency since, in some cases, using an asymmetric proposal distribution can increase the speed of the Markov chain.

The code below creates a Metropolis-Hastings function that allows you to sample from a bivariate distribution, where \(g\) is the bivariate normal distribution.

metropolis_hastings <- function(f, iter = 1e3, chains = 4, init = NULL, prop_scale = 1, prop_loc = c(0,0)) {
  out <- array(NA, c(iter, 2, chains))
  # setup initial values
  for(p in 1:2) {
    if(is.null(init))
      out[1,p,] <- rnorm(chains, 0, 1)
    else
      out[1,p,] <- init[,p]
  }

  prop_covmat <- diag(1,2)*prop_scale

  # create sampler
  sampler <- function(x) {
    alpha <- 0
    u <- 1
    reject <- -1
    while(alpha < u) {
      reject <- reject + 1
      u <- runif(1, 0, 1)
      x_star <- rmvnorm(1, x, prop_covmat)
      alpha <- exp((f(x_star)-f(x)) +
                     (dmvnorm(x, x_star, prop_covmat, log = TRUE) - dmvnorm(x_star, x, prop_covmat, log = TRUE)))
      # cat("alpha = ", alpha, "; ", "u = ", u, "\n")
    }
    return(list("post" = x_star, "reject" = reject))
  }

  # run sampler for each chain
  for(j in 1:chains) {
    cat(":: chain", j, "::","")
    reject <- 0
    for(i in 2:iter) {
      sampler_stuff <- sampler(out[i-1,,j])
      reject <- reject + sampler_stuff$reject
      out[i,,j] <- sampler_stuff$post  # important! the rest is just print aesthetics
      if(i%%(iter/10) == 0)
        cat(".")
      if(i == iter)
        cat("", "accept rate =", iter,"/", reject + iter, "=", round(iter/(reject+iter),4), "\n")

    }
  }

  out
}

The code below creates the data and runs the Metropolis-Hastings algorithm. Our posterior distribution that we are interested in sampling from can be stated as, \[ y \sim \mathcal{MN}(\vec{\mu}, \mathbf{\Sigma}) \\ \vec{\mu}_1 \sim \mathcal{N}(0, 0.1) \\ \vec{\mu}_2 \sim \mathcal{N}(0, 0.1) \\ \]

Where \(\Sigma\) is a diagonal matrix with ones on the diagonal.

library(mvtnorm)

# parameters and data
covmat <- diag(1,2)
N <- 5
priors <- cbind(rnorm(N, 0, 3), rnorm(N, 0, 3))
dat <- array(NA, c(N, 2))
for(i in 1:N) {
  dat[i,] <- rmvnorm(1, mean = priors[i,], sigma = covmat)
}

# initial values for algorithm
init_val <- rbind(c(1,1),
                  c(-1,-1),
                  c(1,-1),
                  c(-1,1))
init_val <- init_val * 4

# posterior distribution
posterior_func <- function(pars) {
  lik <- sum(dmvnorm(dat, pars, sigma = covmat, log = TRUE))
  prior <- dnorm(pars, 0, 0.1, log = TRUE)
  lik + sum(prior)
}

# run algorithm
metro_hasti <- metropolis_hastings(f = posterior_func,
                                   iter = 2e3, chains = 4, init = init_val,
                                   prop_scale = 0.005, prop_loc = colMeans(dat))
## :: chain 1 :: .......... accept rate = 2000 / 3059 = 0.6538 
## :: chain 2 :: .......... accept rate = 2000 / 3040 = 0.6579 
## :: chain 3 :: .......... accept rate = 2000 / 3124 = 0.6402 
## :: chain 4 :: .......... accept rate = 2000 / 3065 = 0.6525

Below we have the means of each parameter using the data and the sampled values. Again, we get a lot closer to the true parameter values with the Bayesian approach, compared to the maximum likelihood approach.

##                        mu1         mu2
## mean data      -0.88561791 -0.72070321
## mean posterior -0.04652705 -0.03576534

Now we have examine the convergence of the chains. The plot below looks at convergence in the two dimensional space. (We used starting values in each corner of a square for illustrative purposes.)

The figure below illustrates the traceplots for each chain with the warmup shaded in grey.

The figure below illustrates the samples for each parameter using histograms.

Tuning

The Metropolis and Metropolis-Hasting algorithms are sensitive to tuning. Above we used a fairly inefficient tuning parameter prop_scale = 0.005 (this is apparent in the large number of accepted proposals). It is inefficient in the sense that it over accepted proposals out in the tails. This isn’t a huge problem it we run a large number of iterations, but that is not always feasible.

metro_hasti_tuned <- metropolis_hastings(f = posterior_func,
                                   iter = 2e3, chains = 4, init = init_val,
                                   prop_scale = 0.1, prop_loc = colMeans(dat))
## :: chain 1 :: .......... accept rate = 2000 / 13018 = 0.1536 
## :: chain 2 :: .......... accept rate = 2000 / 13263 = 0.1508 
## :: chain 3 :: .......... accept rate = 2000 / 13362 = 0.1497 
## :: chain 4 :: .......... accept rate = 2000 / 13371 = 0.1496

If we plot the first 100 iterations, notice that the chains with the inefficient tuning parameter did not even get a chance to mix under the center mass of the distribution. However, with better tuning there is already a lot of mixing taking place.

Hamiltonian Monte Carlo Algorithm

ALGORITHM

Aside: Gibbs Sampling

Gibbs sampling requires that your priors are conjugate priors. This means that your prior distribution and posterior distribution belong to the same family of distributions. This involves specifying the kernel of \(p(\theta|y)\), recognizing what distribution this kernel represents, and then drawing from this distribution. This method of sampling is not used so much given the restriction of using conjugate priors and that the transition of the sampler moves at right angles across the distribution.

library(mvtnorm)
y <- rmvnorm(100, c(0, 3))
gibbs <- function(y, y_0, iter, rho = 0.5) {
  out <- array(iter, ncol(y))
  out <- unif()
  for(i in 1:iter) {
    mu1 <- rnorm(1, mu2)
    mu2 <- rnorm(1, mu1)
    out(i,) <- cbind(mu1,mu2)
  }
}